Extracellular matrix profiles determine risk and prognosis of the squamous cell carcinoma subtype of non-small cell lung carcinoma

Background Squamous cell carcinoma (SqCC) is a subtype of non-small cell lung cancer for which patient prognosis remains poor. The extracellular matrix (ECM) is critical in regulating cell behavior; however, its importance in tumor aggressiveness remains to be comprehensively characterized. Methods Multi-omics data of SqCC human tumor specimens was combined to characterize ECM features associated with initiation and recurrence. Penalized logistic regression was used to define a matrix risk signature for SqCC tumors and its performance across a panel of tumor types and in SqCC premalignant lesions was evaluated. Consensus clustering was used to define prognostic matreotypes for SqCC tumors. Matreotype-specific tumor biology was defined by integration of bulk RNAseq with scRNAseq data, cell type deconvolution, analysis of ligand-receptor interactions and enriched biological pathways, and through cross comparison of matreotype expression profiles with aging and idiopathic pulmonary fibrosis lung profiles. Results This analysis revealed subtype-specific ECM signatures associated with tumor initiation that were predictive of premalignant progression. We identified an ECM-enriched tumor subtype associated with the poorest prognosis. In silico analysis indicates that matrix remodeling programs differentially activate intracellular signaling in tumor and stromal cells to reinforce matrix remodeling associated with resistance and progression. The matrix subtype with the poorest prognosis resembles ECM remodeling in idiopathic pulmonary fibrosis and may represent a field of cancerization associated with elevated cancer risk. Conclusions Collectively, this analysis defines matrix-driven features of poor prognosis to inform precision medicine prevention and treatment strategies towards improving SqCC patient outcome. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01127-6.

gene expression analysis was performed using the voom algorithm (default settings, Limma package (16)). Heatmaps of gene expression matrices were generated using the ComplexHeatmap package (17) for visualization.

Matrix Risk Signature Generation
Risk signature feature selection was performed on differentially expressed core matrisomal genes between tumor and non-tumor tissue in the TCGA LUSC dataset.
Genes were selected based on their association with tumor compared with non-tumor tissue using a penalized logistic regression implemented using the glmfit function from the glmnet package (18), to account for the high degree of correlation of some matrix genes. Z-scaled RNAseq expression data were initially partitioned 80% to 20% into training and test datasets, respectively using the createDataPartition algorithm of the caret package and the model was developed using training data only. Penalized logistic regression was implemented to refine the matrix risk signature to include only those matrisomal genes that most significantly discriminated between tumor and non-tumor tissue while ensuring that the model was not over-fitted to the data. The penalized logistic regression model was developed on the training dataset and was considered most appropriate due to the large variance in the ordinary least squares estimator due to some highly correlated matrix genes. The coefficient of shrinkage (lambda) was calculated to balance the number of predictors against accuracy and was chosen as the value within one standard error of the lambda that minimized the cross-validation prediction error rate, implemented using the cv.glmnet algorithm. Alpha was set to 0.5 to apply Elastic Net regression. Elastic net regression was selected over lasso regression as elastic net regression balances the minimization and elimination of poorly -predicting variables, while lasso regression eliminates variables and can remove too many variables, causing poor generalization of the model (18). As a result, elastic net regression retains as many ECM genes in the signature required to accurately fit the data, with more ECM genes to enable the model to perform well when applied to independent datasets. The final model was applied to the test data where it gave 100% accuracy of predicting tumor and non-tumor tissues. To generate a matrix risk score and for visualization purposes, Odds ratios were calculated using Firth's correction using the logistf algorithm from the logistf package. The matrix risk score for each sample was summed as the product of the log(Odds Ratio) (Supplementary Table 1

Matreotype Identification
Identification of matreotypes was performed using monte-carlo reference-based consensus clustering implemented through the M3C command using the K-means clustering algorithm (default settings, M3C package (19)) applied to the TCGA expression matrix of significantly differentially expressed core matrisomal genes from tumor and non-tumor tissue. We used this algorithm to accurately determine the optimal cluster number and avoid overfitting of the data by simulating a null distribution of stability scores for a range of cluster numbers (K) and correcting the inherent bias of consensus clustering towards high K values. Cluster number was determined as the number of clusters to give a maximal Relative Cluster Stability Index (RCSI), a Monte -Carlo p-value less than 0.05 and a minimal Proportional of Ambiguous Clustering (PAC) Score (19). Cluster assignments were manually checked by Principal Component Analysis (PCA). Clustering statistics (silhouette width, Dunn index) were calculated by the cluster.stats algorithm of the fpc package.
Centroids for each matreotype were calculated on the TCGA LUSC dataset as the mean z-scaled expression level for those genes used to identify the matreotypes (i.e. significantly differentially expressed core matrisomal genes between tumor and nontumor tissue). Matreotypes were assigned as the matreotype that gave the minimal Euclidean distance between each sample and the matreotype centroids. The association of matreotypes with categorical clinicodemographic information was assessed using the Fisher's exact test. Survival associations were assessed using the Survival and Survminer packages. Hazard ratios were calculated using the coxph function from the Surv package with corrections for stage and age, which were clinical covariates significantly associated with survival in univariate analysis. Gender, smoking status and pack years were clinical covariates not found to be significantly associated with survival in a univariate analysis so were not included in the multivariate model.
Matreotype associations with driver mutations were assessed using maftools. FGFR copy number variation analysis was performed on TCGA CNV data using probes mapped to the segment containing the FGFR cluster at chromosome 8p11. 23 (20)). Differential enrichment of these pathway activity estimates in the different matreotypes were tested by fitting a linear model (lmFit algorithm with default settings, limma package) and calculation of differential expression by empirical Bayes moderation of the standard error to a global value (eBayes algorithm with default settings, limma package) (16).

Ligand Receptor Interaction Analysis
Genes within the RNAseq datasets were annotated as ligands and receptors based on the curated database of human ligand-receptor pairs previously published by  (14) were retained for further analysis. We used this database to identify receptors that the core matrisomal components interact with to influence cell signaling processes within the tumor microenvironment. The strength of these ligand receptor interactions were calculated as an interaction score, based on the expression level of the core matrisomal ligand and its cognate receptor, as described previously (22). Mathematically, the interaction score of each ligand-receptor pair was calculated as the product of expression values from the ligand (i.e. core matrisomal gene) and its cognate receptor in each sample, (Equation

Cellular Composition Analysis
Two main approaches were implemented to identify enriched cell types in SqCC matreotypes using datasets described in Table 1

Single Cell RNAseq Analysis of Squamous Cell Carcinoma
Raw gene expression matrices and cellular metadata including cell type assignments from scRNAseq data of NSCLC samples were obtained from (24,26) and processed as described. Briefly, scRNAseq expression processing was performed in the Seurat package (v4.0.1). Cells with fewer than 201 UMIs and over 6000 or below 101 expressed genes, or over 10% UMIs derived from the mitochondrial genome were removed. Gene expression matrices were normalized to total cellular read count and mitochondrial read count as implemented by Seurat's normalize and scale functions.
Variably expressed genes were selected as having a normalized expression between 0.125 and 3 and variance exceeding 0.5. To reduce dimensionality of the dataset, variably expressed genes were summarized by Principal Component Analysis (PCA) and the first 8 principal components were further summarized using tSNE dimensionality reduction using the default settings of the RunTSNE function. Cell clusters were then annotated using the cluster annotations assigned by (24). Cells corresponding to SqCC tumors were retained for further analysis. Cell type scores for ECM genes were assigned using the AddModuleScore function (Seurat package, default settings). Matrix risk scores were calculated on the z-scaled gene expression matrix for each cell type using Equation 1 as described above (Matrix Risk Signature section).

Fibrosis Score
In order to characterize the extent to which SqCC ECM remodeling overlaps with that in Idiopathic Fibrosis we calculated an IPF fibrosis score for each TCGA sample based on the expression of differentially expressed genes in in human IPF lungs compared

Transcription Factor Enrichment Analysis
Transcription factor analysis on core matrisomal correlation clusters was performed using the chEA transcription factor targets database using the chEA3 web interface (28). which mines a database of CHIP-based experimental results and scores gene sets for transcription factor associations based on ENCODE Chip-Seq, publicly available CHIP-Seq data describing binding site proximity as well as the co-expression of genes in the GTeX and ARCHS4 datasets. As the transcription factor regulation of many ECM genes have not yet been comprehensively characterized, integrating multiple CHIP databases with expression analysis in this approach expands the scope for identifying upstream regulators of ECM gene expression. Gene lists corresponding to each core matrisomal correlation cluster were uploaded to the chEA3 server and the integrated mean rank for each transcription factor was calculated for each correlation cluster.

Picrosirius Red Staining and Quantitation of Tissue Microarrays
Each core of the tissue microarray was assessed for tumor tissue by a registered clinical The total red area was normalized to the total tissue area to account for spaces within the tissue corresponding to vessel and airway lumens.
To assess the association of picrosirius red staining with survival, the maximum total red signal normalised to tissue area across three to five tumor cores was calculated per patient. Patients were stratified into high and low groups by the 60 th percentile. This    Figure 3A). D) Heatmap of marker genes for the ECM-High and ECM-Low matreotypes in the TCGA cohort. Multivariate cox-proportional hazards models include stage as a clinical covariate. Hazard ratios and confidence intervals are in Additional File 1: Table S8. Additional Tables   Additional File 1: Table S1. Matrix Risk Score